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Abstract 

Genetic interaction measures how different genes collectively contribute to a pheno- 
type, and can reveal functional compensation and buffering between pathways under 
genetic perturbations. Recently, genome-wide investigation for genetic interactions has 
revealed genetic interaction networks that provide novel insights both when analyzed 
independently and when integrated with other functional genomic datasets. For higher 
eukaryotes such as human, the above reverse-genetics approaches are not straightfor- 
ward since the phenotypes of interest for higher eukaryotes such as disease onset or 
survival, are difficult to study in a cell based assay. 

In this paper, we propose a general framework for constructing and analyzing hu- 
man genetic interaction networks from genome-wide single nucleotide polymorphism 
(SNP) datasets used for case-control studies on complex diseases. Specifically, we 
propose a general approach with three major steps: (1) estimating SNP-SNP genetic 
interactions, (2) identifying linkage disequilibrium (LD) blocks and mapping SNP-SNP 
interactions to LD block-block interactions, and (3) functional mapping for LD blocks. 
We performed two sets of functional analyses for each of the six case-control SNP 
datasets used in the paper, and demonstrated that (i) genes in LD blocks showing 
similar interaction profiles tend to be functionally related, and (ii) the network can 
be used to discover pairs of compensatory gene modules (between-pathway models) 
in their joint association with a disease phenotype. The proposed framework should 
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provide novel insights beyond existing approaches that either ignore interactions be- 
tween SNPs or model different SNP-SNP pairs with genetic interactions separately. 
Furthermore, our study provides evidence that some of the core properties of genetic 
interaction networks based on reverse genetics in model organisms like yeast are also 
present in genetic interactions revealed by natural variation in human populations. 

1 Introduction 

Genetic interaction measures how different genes collectively contribute to a phenotype, and 
can reveal functional compensation and buffering between pathways (or protein complexes) 
under genetic perturbations. Recently, genome-wide screening of genetic interactions have 
become possible for different species via high-throughput methods [8] in which the phenotypic 
effect of the double-knockout of each pair of genes are compared with the aggregated effects 
of the two individual knockouts under an assumption of independence. An extreme case, 
called synthetic lethality, occurs when a double knockout results in the death of a cell even 
when the single knockouts have no effect. The genetic interaction networks revealed by 
these experiments provide novel insights both when analyzed by themselves [30] and when 
integrated with other molecular networks or genomic datasets |8] , such as physical interaction 
[20] . gene expression [17] and chemical-genetic interaction data [27]. For higher eukaryotes 
such as human, these reverse-genetics approaches have not been as straightforward due to 
both less amenable genetics and more complex phenotypes of interest such as disease onset 
and survival, which are difficult to study in cell based assays. 

Despite slow progress in mapping genetic interactions using reverse genetics approaches 
in human cells, we are accumulating a wealth of data on individual genetic variations in 
human populations. Genome-wide association studies (GWAS) capturing single-nucleotide 
polymorphisms (SNPs) or copy number variations ^25j that have been widely applied for 
studying genetic differences between disease samples (cases), and normal samples (controls) 
|18j . offer an alternative means to map genetic interactions. For example, if two genetic 
variants have weak individual association with a disease but very strong joint association, 
the genes controlled by the two variants may have compensating functions that can buffer 
variations in each other, but yield a much higher risk of the disease phenotype of interest 
for a joint mutation. 

Indeed, the genetic interactions between genetic variants (such as SNPs and CNVs) have 
been extensively studied as statistical epistasis (used interchangeable with genetic interac- 
tion in this paper) p] in both genome-wide [7] and targeted studies [261 lEl E] ■ However, 
existing approaches mostly consider pairs (or high-order combinations) of genetic variants as 
separate interaction candidates, and hence estimate their statistical significance and study 
their biological interpretations in an isolated manner [7] . While many statistically significant 
and biologically relevant instances of epistasis are discovered, these approaches may have the 
following drawback. Several pairs of genetic variants may not have statistically significant 
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genetic interactions when considered as individual interaction candidates, but nevertheless, 
can be collectively significant if they are highly enriched for two pathways that have com- 
plementary functions (a between pathway model, BPM [2Q]). This limitation motivates the 
design of approaches that can model the collective significance of SNP pairs based on their 
network structure. 

Specifically, we aim to construct human disease-specific (cases vs. controls) genetic in- 
teractions from GWAS case-control datasets and then discover BPMs from the constructed 
network, i.e., two sets of genetic variants that have many genetic interactions across the two 
sets but none or very few within either set. For the community doing research on genetic 
interaction, the novelty is the exploration of whether the biologically interesting structures 
(e.g., BPMs) discovered from the genetic interaction networks of lower eukaryotes such as 
yeast also exist in complex diseases, such as cancer and neurological diseases, for higher 
eukaryotes such as humans. For the community doing research on GWAS data analysis, this 
work provides additional evidence to support the shift from the analysis of single genes (or 
SNPs) to sets of genes (e.g., gene sets [23 E2] or protein interaction subnetworks \3[ 
Existing approaches that focus on the discovery of individually significant pathways or sub- 
networks ignore those pathways or subnetworks that are individually insignificant but are 
compensatory to each other as a combination (e.g., pairs of pathways or subnetworks) in 
their strong association with a disease phenotype. 

In this paper, we propose a general framework for constructing human disease-specific 
genetic interaction networks with GWAS data. Because different types of genomic data have 
unique characteristics that need to be addressed, we focus on genome-wide case-control SNP 
data and its accompanying linkage disequilibrium (LD) structure. We discuss the challenges 
in the construction of genetic interaction networks due to LD structure and propose a general 
approach with three steps: (1) estimating SNP-SNP genetic interactions, (2) identifying LD 
blocks and summarizing SNP-SNP interactions to LD block-block genetic interactions, and 
(3) functional mapping (e.g. gene mapping) for each LD block. 

To illustrate how the constructed genetic interaction network can be used to obtain both 
known and novel biological insights about disease phenotype of interest in the case-control 
study we designed two sets of functional analyses (/ and //) to analyze the genetic interaction 
networks constructed on each of the six case-control SNP datasets used in this study. 

For functional analysis /, we study whether a constructed human genetic interaction net- 
work has functional significance with respect to independent biological databases. Specifi- 
cally, we compare the LD block-block genetic interaction network and the genetic-interaction- 
profile-based LD block-block similarity network with the human functional network inte- 
grated in [T^. Interestingly, we find that the pairs of LD blocks that have high genetic 
interaction and those pairs that have high similarity of genetic interaction profiles have sig- 
nificantly higher functional similarity. This motivates the potential utility of the constructed 
genetic interaction network for revealing both known and novel biological insights into the 
disease phenotype of interests in a case-control study. 

For functional analysis //, we study how to use the constructed human genetic interaction 
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network to provide detailed insights about the compensation between pathways in their joint 
association with a disease phenotype. Specifically, we discover between pathway models 
(BPM) from the block-block genetic interaction network. A BPM contains two sets of 
LD blocks, which have many cross-set genetic interactions but very few within-set genetic 
interactions. The experiments on the six SNP datasets demonstrate that the discovered 
BPMs have statistically significant properties (supported by permutation tests of case-control 
groupings) such as across-set densities of genetic interactions and functional enrichments 
based on three sets of biological databases. The significant BPMs may provide indications 
of the compensation between pathways (or protein complexes) in their association with the 
disease phenotypes, and serve as a novel type of biomarker for complex diseases. 

Comprehensive experimental results on the six case-control SNP datasets support several 
points: (i) From the perspective of genetic interaction analysis, the constructed human ge- 
netic interaction network has functional significance, and the biologically interesting motifs 
such as BPM that are common in lower eukaryotes also exist in human with respect to com- 
plex diseases such as cancer and Parkinson's disease; (ii) From the perspective of GWAS data 
analysis and biomarker discovery, discovering BPMs from the constructed human genetic in- 
teraction network can help reveal novel biological insights about complex diseases, beyond 
existing approaches for GWAS data analysis that either ignore interactions between SNPs, or 
model different SNP-SNP genetic interactions separately rather than studying global genetic 
interaction networks as in this study. 

2 Methods 

We first present a general framework for constructing genetic interaction networks from 
genome-wide case-control SNP datasets, and then describe two approaches for the functional 
analysis of the resulting networks that can be used to discover novel biological insights about 
complex diseases. For each step in this process, we selected a particular approach. Various 
alternatives are possible, but due to the limitation of space and the desire to clearly explain 
our general approach, we do not discuss those alternatives in any detail. However, given the 
significant results obtained by the current approach consistent over six datasets (Section [3]), 
some of these alternatives should be explored to see if additional improvements in the results 
are possible. 

2.1 Network Construction 

There are three steps in the network construction framework: (i) measuring all pairwise 
SNP-SNP genetic interaction with respect to the case-control grouping in a GWAS dataset, 

(ii) summarizing SNP-to-SNP interactions into a block-level genetic interaction network, and 

(iii) functional mapping for each block. 
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2.1.1 SNP-to-SNP Genetic Interaction 



The principal goal of measuring genetic interactions between two SNPs is to capture the 
non-additive effect between the two SNPs in their combined association with the phenotype 
of interest. For this purpose, we leverage the extensive research on statistical epistasis [6] 
that has been recently reviewed by H. Cordell (7]. 

Among the different measures for epistasis, we selected the information theoretic synergy 
[1]. The synergy between two SNPs with respect to a binary class label variable (cases vs. 
controls), i.e., Sync{Si, Sj) is defined in [1] as follows: 

SynciS,, S,) = I{S,, S,; C) - /(S,; C) - 1(8,; C), (1) 
where I{X; C) denotes the mutual information between the class variable C and a variable 

X 

(2) 

In this paper, synergy and mutual information are always normalized by H{C), after 
which, Sync{Si, Sj) ranges from —1 to +1. The focus of this paper is on positive synergy 
since we want to measure the interaction between two SNPs beyond the independent additive 
effect of their joint association with a case-control phenotype. The interpretation of a positive 
synergy, e.g. 0.2, is that the two SNPs as a combination provide 20% more information about 
C beyond the summation of information provided independently by the two SNPs. In this 
paper, we say two SNPs have a positive genetic interaction if they have a positive synergy 
(beyond-additive effect) to keep the sign of genetic interaction synergy consistent. Note 
that, in reverse-genetics based yeast genetic interaction, negative genetic interaction is used 
to denote beyond-additive effect. 

We chose to compute synergy for all pairs of SNPs rather than just those pairs for which 
one SNP has sufficiently large marginal effects [H] since we did not want to risk missing 
SNP pairs that have weak (or no) marginal effect but a strong combined effect as discussed 
in [34] since these pairs are essential for building an interaction network that may have even 
better statistical power than other approaches. After the synergy calculation for all pairwise 
SNPs, we get a full weighted SNP-SNP network. We denote this matrix as Ml, as shown in 
Figured]). This network cannot be directly interpreted because of the LD structure in the 
SNP data. In the next section, we discuss this challenge in detail and present an approach 
to address it. 

2.1.2 LD Block-block Genetic Interaction Network 

Due to the LD structure in SNP data, nearby SNPs tend to have correlated genotypes over 
the samples. Therefore, if a pair of SNPs {Si and Sj) have large synergy, the SNPs close 
to Si probably also have large synergy with the SNPs close to Sj. This can result in a 
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Figure 1: Flowchart of constructing genetic interaction networks from case-control datasets. 
Three steps: (1) Calculating pairwise SNP-SNP genetic interactions, yielding Ml; (2) Iden- 
tifying LD blocks; and (3) summarizing a SNP-SNP Network to a block-block genetic inter- 
action network (M2). 



trivial type of local motif (approximate bicliques) in the SNP-SNP network as illustrated in 
Figure [21 This biases the functional analysis since such bicliques do not reflect the functional 
similarity between the SNPs in the same LD block. In order to gain non-trivial insights from 
the genetic interaction network, we propose to summarize the SNP-SNP interaction network 
by an LD block-block network. 

Identifying LD blocks: Different measures for estimating LD such as and D' [9] are 
specifically designed for measuring the non-random associations between polymorphisms at 
different loci. These measures capture the difference between observed and expected allelic 
frequencies (assuming random distributions), which depend on the phase information and 
define an LD block within a genomic region. With a related but different purpose, our goal 
in this study is to identify a set of SNPs on the same chromosome having similar genotype 
profiles and use a single block to represent them. 
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Figure 2: Illustration of a trivial biclique pattern due to LD structure in SNP data and how 
it is summarized into a representative block-block edge. Such biclique does not reflect the 
functional similarity between the SNPs in the same LD block. There should be only one 
edge connecting two functional units. 

We use Hamming similarity to measure the correlation between the genotype profiles of 
SNPs. (SNPs are columns in the case-control data as illustrated in Figured]) This similarity 
serves better for our purpose of measuring mathematical similarity between two SNPs rather 
than the LD. We take such a conservative approach in order to make sure that we do not 
create two separate LD blocks for SNPs that have similar genotype profiles. This also avoids 
the estimation of phase information, which adds additional uncertainty that may confound 
the analysis. We do not restrict an LD block to be within a local genomic region. This is 
because SNPs that are far from each other can also have high genotype correlation, at least 
from the mathematical perspective as shown in [21]. Again, we take such a conservative 
approach in order to make sure that we do not create two separate LD blocks for SNPs that 
have similar genotype profiles, which may happen if a window-size constraint is used. For 
simplicity, we perform a greedy search of LD blocks. Specifically, we randomly take a SNP 
and combine it with all the other SNPs (on the same chromosome) with Hamming similarity 
above a threshold h [h = 0.7 is used in this study) as an LD block. A SNP will only be 
assigned to one LD block. 

SNP-SNP Network to LD Block-Block Network: 

After identifying all the LD blocks in a dataset, we have a many-to-one mapping from 
all the SNPs to a set of blocks. Given this mapping, we summarize the SNP-SNP synergy 
network to a block-block synergy network using the following general function to estimate 
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the synergy between two blocks Bi and 5^, 

Sync{Bi, Bj) = s,ciB,,s,&B,Sync{Si, Sj) (3) 

where \I/ denotes a general aggregation function, e.g. max or mean. In this paper, we 
adopt the max function based on the following reasons and observations: (1) Biologically, 
it is likely that only one pair of SNPs across two LD blocks are truly causative in the 
case-control phenotype, in which case max is the ideal aggregation function. (2) Based 
on multiple datasets used in the experiment section, the max function consistently yields 
coherence with existing biological knowledge gained from yeast genetic interaction networks. 
(3) In the sanity check, the pairs with top synergy values have similar LD-block sizes as the 
null distribution, and thus are not due to the bias of large LD block sizes. There are other 
aggregation approaches [El [28] that we will explore in future work. 

After this step, we have a block-block genetic interaction network, which we denote as 
M2 as illustrated in Figure [H 

Functional Mapping for each LD Block: After the construction of an LD block-block 
genetic interaction network, a functional mapping for each block is required to interpret the 
structure of the interaction network in functional terms that have biological meaning. For 
this purpose, we first assign each SNP to the closest gene based on its genome location. Then, 
the genes of an LD block are obtained from the SNPs that were assigned to that block in 
the LD identification step. We will explore more advanced gene mapping strategies in future 
work. Interestingly, even with this simple gene mapping approach, the functional analysis 
in Section [3] shows that the constructed LD block-block interaction network still appears 
to have functional structure. Note that the gene mapping does not result in a gene-gene 
interaction network. This is because an LD block can span multiple genes so that gene-gene 
network derived from the blocks would contain many trivial biclique patterns. From this 
perspective, the genetic interaction network constructed for yeast in [13] from eQTL data 
may have included a large number of false positive gene-gene edges since it connected all the 
gene pairs from the two LD blocks. In contrast, the block-block network constructed in this 
study has the least amount of bias from trivial bicliques. 

2.2 Network Analysis 

In Section 12. ![ we presented a framework for constructing a block-block genetic interaction 
network from a human case-control genomic dataset. In this subsection, we present two sets 
of functional analyses: /, which Comparing the constructed LD block-block genetic interac- 
tion network and the corresponding similarity network derived from the human functional 
network, and //, which discovers between pathway models (BPM) from the LD block-block 
genetic interaction network for functional enrichment analysis [20]. Both types of analy- 
sis have been used in the systematic interpretation of double-knockout-based yeast genetic 
interaction networks [SI ED]- Here we use a similar approach to reveal novel biological in- 



8 



M3: Binarized 

Pairwise 
Block-Block 
Genetic Interaction 



Functional Anaiysis f 



Three sources of bioiogicaf background knowledge 



Pairwise Jaccard 
similarity calculation 
from interaction matrix 



M4: Pairwise 
Biock-Block 

Similarity based on 
the profile of 

Genetic Interaction 



Discovering 
Between 
Pathway Model 
(BPM) 



□ 



□ 



D 4= 




M6: Block-Block 
Pairwise 
Functional 
Similarity Network 



, J 

p" Gene-gene funcSonal 
similarly fo bfock-block 
functional similanty 

Genes 



M5: Human 
gene-gene 

pairwise 
Functional 

Network 



Human 
GO Annotation 



Human 
Molecular 
Signature 
Database 
(MSigDB) 



Functional Analysis U 



Figure 3: Flowchart of the functional analysis on the binarized genetic interaction profiles 
M3. There are two types of analysis: (/) constructing the block-block similarity network 
(M4) from M3 and enrichment analysis, and {II) discovering BPMs (between pathway mod- 
els) from M3 and enrichment analysis. Both enrichment analyses are with respect to three 
sources of biological knowledge, namely (i) the block-block functional similarity network 
(M6) derived from the human functional network M5, (ii) the human GO annotations and 
(iii) the human molecular signature database (MSigDB). 

sights from the genetic interaction networks constructed from human case-control genomic 
datasets. Figure [3] shows the overall design of the two types of functional analysis and will 
be referenced extensively in the rest of this section. 

An important point in understanding following discussion is that we first threshold the 
block-block genetic interaction matrix (M2) to a binary matrix (M3). Specifically, we binarize 
this network with a quantile threshold q (e.g. 1%), such that those block-block edges with 
synergy in the top q quantile (those with large beyond-addtive effect interactions) are kept in 
the binary network. We denote the matrix representation of this network as M3 as illustrated 
in Figure [3l Both of the analyses make use of binarized matrix M3. 

2.2.1 Functional analysis /: Comparing the constructed LD block-block ge- 
netic interaction network and the corresponding similarity network de- 
rived from a human functional network 

In the first approach, we study whether the constructed human genetic interaction network 
has functional significance supported by independent biological databases. Specifically, we 
first use Jaccard similarity to measure the similarity between the profiles of genetic interac- 
tions of two blocks in M3. This results in matrix M4 as shown in Figure [3l The motivation is 
that the analysis of global yeast genetic interaction networks [8] has shown that the similarity 



9 



between such interaction profiles of two genes is correlated with the functional similarity of 
two genes. In the experimental section, we will use the constructed block-block similarity 
matrix M4 to compare the similarity of block interaction profiles with the human gene-gene 
functional network integrated in |19] . 

2.2.2 Functional analysis //: Discovering BPMs (between pathway models) 
from the LD block-block genetic interaction network for functional en- 
richment analysis 

Functional analysis approach II, takes a complementary approach to discover between path- 
way models (BPM). (This approach was demonstrated to be effective in the analysis of yeast 
genetic interaction networks jSQ]-) Using insights gained from yeast genetic interaction net- 
work, a BPM contains two sets of genes that have many across-set genetic interactions and 
within-set protein-protein interactions [20] . The two sets of genes in a BPM may correspond 
to two biological pathways (or protein complexes) that have redundant (or complementarity) 
biological functions with respect to the case-control grouping. In the context of this study, 
a BPM contains two sets of LD blocks, which have many large cross-set genetic interactions 
(synergy values) in M3 and very few within-set genetic interactions. 

Discovering BPMs from the binarized disease-specific genetic interaction matrix (M3) 
can provide novel insights beyond existing approaches designed for analyzing case-control 
SNP datasets. Recently, existing approaches for analyzing case-control datasets have shifted 
from discovering single genes (or SNPs etc) to sets of genes (e.g. pathways [29] or protein 
interaction subnetworks [1]). While many statistically significant and biologically relevant 
gene sets or subnetworks are discovered, existing approaches may ignore those pathways 
or subnetworks that are individually insignificant but are compensatory to each other as 
a combination in their strong association with a disease. From this perspective, a BPM 
captures the compensation between two pathways in their combined association with complex 
diseases such as cancer that may be caused by the perturbation of multiple (e.g., a pair of) 
pathways but not individual pathways. This may lead to the discovery of a new type of 
complex biomarker, i.e. pairs of compensating pathways or protein complexes. 

Many algorithms have been proposed to discover BPMs from yeast genetic interaction 
networks. These algorithms mostly depend on integration of both physical interaction net- 
work and genetic interaction network [201 EI]- Such integrative approaches have the ad- 
vantage of better interpretability because of the integration with physical interaction data. 
However, given our goal of discovering BPMs from human case-control datasets and the fact 
that human protein interaction network is not as complete and lacks reproducibility [19], an 
integrative approach may miss many BPMs that are not yet well-supported by the existing 
limited functional knowledge of the human genome. 

For the above reason, in this study, we took an exploratory approach in which, we search 
for BPMs only from the genetic interaction matrix M3. Given a binary symmetric matrix 
(M3), the BPM discovery problem can be reformulated as a quasi-biclique discovery problem 



10 



A quasi-biclique is defined as a non-weighted bipartite graph G = {X U Y, E) (where X 
and Y are two sets of LD blocks and E is a, set of LD block-block genetic interaction edges) 
such that, for a given parameter < 5 < 0.5. 



where d{x, Y) denotes the number of edges between x and all the nodes in y (similarly for 
d{y,X)). In this paper, we adopt a greedy algorithm with polynomial time complexity [23] 
to efficiently search for quasi-blciques from the binary block-block network (M3). Note that, 
several other algorithms that are designed for quasi-biclique discovery or BPM discovery [24J 
can also be applied for the same purpose. These will be explored in future work as this paper 
focuses on presenting the overall framework. 

Based on the definition, a quasi-biclique may also have many edges within each of the 
two sets {X and Y), while BPMs that with no or very few within-set edges are relatively 
more interesting. Therefore, after the discovery of a set of quasi-bicliques, a postprocessing 
will be applied to further select a subset of bicliques with a small fraction of within-set ge- 
netic interactions as BPM candidates for the following functional analysis. We will design 
experiments to interpret the discovered BPMs with human functional network [19] and con- 
duct enrichment analysis with human GO annotations [2] and human molecular signature 



3 Results 

In this section, we present the experimental results on the two sets of functional analyses 
(described in Section 12. 2p on the network constructed with the framework presented in 
Section [2IU 

3.1 Data Sets 

In the experiments, we use six case-control SNP datasets, one from a genome- wide association 
study on Parkinson's disease (disease vs. normal) [13j, and the others from targeted studies 
(with around 3000 SNPs over about 1000 genes) on Myeloma long vs. short survival [26], 
Myeloma cases vs. controls fT5], lung cancer cases vs. controls (both heavy smokers) [5], 
rejection vs. no- rejection after kidney transplant [T6|, bone disease (large vs. small number 
of bone lesions ) [26|. We denote these 6 datasets as Parkinson, Myeloma-survival, Myeloma, 
Lung, Kidney, Bone in this section. For the genome-wide Parkinson data, we selected 8994 
non-synonymous SNPs for this study for the concern of computational efficiency, given the 
huge number of all pairwise SNP-SNP genetic interaction to compute, especially in a large 
number (100) of permutations for each dataset (functional analysis II). 



WxeX,d{x,Y) > {l-6)\Y\,\fy eY,d{y,X) > {1 - 6)\X 



(4) 




11 



J—-' Oj u OjOC u 


ti of 


of 


li of 


tt of LD 


fj of B-B 




Cases 


Controls 


SNPs 


Blocks 


Edges 


Parkinson 


270 


271 


8994 


5011 


31387 


Myeloma- survival 


70 


73 


3077 


986 


7291 


Lung 


99 


99 


2990 


977 


7159 


Kidney 


135 


136 


2864 


963 


6955 


Myeloma 


244 


149 


2345 


804 


4848 


Bone 


73 


185 


2982 


993 


7395 



Table 1: Information about the datasets. The number of blocks depends on the hamming 
similarity threshold h{0.7 used in this paper). The number of LD block-block (B-B) edges 
depends on the bianrization threshold q (1.5% for all the datasets except Parkinson, for 
which we used 0.25% to speed up the BPM discovery in functional analysis II.) 

Table [T] displays the information about the six datasets, including the number of cases, 
controls, SNPs, LD blocks identified and edges in an LD block-block genetic interaction 
network (M3). 

Three sources of existing knowledge are used for functional analysis: (a) a human func- 
tional network (only edges with w > 0.5 are kept, where w is the functional similarity 
between two genes ranging from to 1), (b) human GO annotations [2] and (c) biological 
gene sets in the human molecular signature database (MSigDB[29]) [f] 

3.2 Functional analysis I: Biological significance of the constructed 
genetic interaction network 

In this section, we investigate if the constructed human genetic interaction network has 
functional significance supported by independent biological data. Specifically, we check the 
functional similarity of the top LD block-block pairs in the genetic interaction network 
(M2) and the similarity matrix M4 with the human functional network (M5) integrated in 
|19] . Because M5 contains pair- wise gene relationships, a similar transformation as step 3 in 
Figure [T]is needed to derive an LD block-block functional similarity network (Mg). Similar to 
Equation [31 we summarize the gene-gene functional similarity network to an LD block-block 
functional similarity network with an aggregation function e.g. max, mean etc. We used 
the max function for the similar theoretical reasons and empirical observations as discussed 
for Equation [3l We note that there are many caveats with this choice but again, we keep 
the overall framework as the focus of this paper, and alternative aggregation functions will 
be explored further in future work. 

^ http: / /sonorus. princeton.edu/hefalmp / 

^Downloaded on Dec 29, 2010 from |http: / /geneontology.org/ 
•^Version 3.0 littp://www. broadinstitute.org/gsea/msigdb/ 
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Figure H] shows the resuhs on comparing the functional similarity of top pairs of LD 
blocks in genetic interaction network (M2) and genetic interaction profile similarity network 
(M4) with the human functional network. The following consistent observations can be made 
over all the six datasets 0: 

1. LD blocks with similar interaction profiles tend to contain functionally 
related genes (Bar 1 is significantly higher than Bar 3): Specifically, if a pair of 
LD blocks have similar profile of genetic interactions, they tend to have significantly 
stronger functional relationship compared to the null distribution for all LD block- 
block pairs. This is biologically meaningful because similar genetic interaction profiles 
indicates probable common membership in a pathway or a protein complex, which 
agrees with observations on the yeast genetic interaction network [8]. 

2. Interacting LD blocks tend to contain functionally related genes (Bar 2 is 
significantly higher than Bar 3): If a pair of LD blocks have high genetic interaction, 
they tend to have significantly strong functional relationship compared to the null 
distribution for all LD block-block pairs. This is biologically meaningful because high 
genetic interaction implies the functional complementarity between two pathways (or 
protein complexes) . This also agrees with observations on the yeast genetic interaction 
network. 

3. Interaction profile similarity is more functionally specific than direct genetic 
interaction does (Bar 1 is significantly higher than Bar 2): This indicates that, if a 
pair of LD blocks have similar profiles of genetic interactions, they tend to contain genes 
with stronger functional relationships compared to those that simply have a genetic 
interaction. This is expected because biological function is relatively more coherent 
for members of the same process or pathway than two complementary (genetically 
buffered) pathways. A similar trend has been observed in the yeast genetic interaction 
network. 

These results suggest genetic interaction networks derived from human genetic association 
studies does produce networks with functional coherence, and this result is robust across the 
six studies we evaluated. We should note that the choice of max over mean for mapping 
the SNP-SNP network to LD block network and mapping the functional network to the LD 
block network does make a difference: the same result is not observed when choosing the 
mean, which suggests that the choice of how to summarize LD block-level statistics is a 
critical one. We note that other more sophisticated summary measures aside from the max 
will likely be superior but leave this for future work. 



"^Two are shown in Fig. |4l and the others on the website 
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3.3 Functional analysis II: Discovering and interpreting BPMs 
from the constructed genetic interaction network 

In this subsection, we present the results of functional analysis 11, i.e. BPM discovery and 
enrichment analysis. 

Given a case-control dataset, a set of BPMs can be discovered from the correspond- 
ing binarized block-block genetic interaction network M3. For each BPM, we compute the 
following measures to capture its various properties: 

1. Sizes: We use three measures to describe the size of a BPM, Ssumi^-^- \X\ + |^|), 
Smaxi^-^- max{\X\, \Y\)) and S'mj„(i.e. m.in{\X\, \Y\)), where X and Y are the two set 
of LD blocks in a BPM, as used in Equation HI 

2. Genetic interaction density as a quasi- biclique: We use three measures Dxx, Dxy, 
Dyy which are the genetic interaction density of the submatrices M^{X, X), M^lX, Y) 
and M3(y, y), respectively. Dxy says the genetic interaction density across the two 
sets of LD blocks, as an indicator of the strength of genetic interaction between the 
two sets. Dxx and Dyy tell the density within each of the two sets, which will be 
used to select a subset of BPMs with small density within the two sets as described in 
Section EXl 

3. Functional coherence with respect to the human functional network: We use three 
measures Fxx, Fxy, Fyy which are the average functional similarity in the subma- 
trices Mq{X, X), Mq{X^ Y) and Mq{Y^ F), respectively. Fxx and Fyy indicate the two 
functional coherence within the two sets of LD blocks in a BPM, and Fxy tells the 
functional relationship between the two set of LD blocks in a BPM. 

Because multiple BPMs can be discovered from a case-control dataset, corrections are 
needed for multiple hypothesis testing. Since the discovery is with respect to the genetic 
difference between cases and controls, we adopt the widely used permutation test [29] in 
which the original case-control groups are randomly shuffled over the entire set of samples in 
a dataset (100 permutations were used on each of the datasets). For each of the permuted 
case-control grouping, the same pipeline for network construction and analysis as illustrated 
in Figures [1] and [3] was repeated. Such permutations preserve the structure of the GWAS 
data and provide an unbiased, non-parametric correction for multiple hypothesis testing. 
After the 100 permutations, empirical false discovery rates (FDRs) (as used in [29]) are 
computed for the following six BPM measures: Ssmm Smax, Dxy, Fxx, Fxy and Fyy. Note 

''Note that, here we use average to summarize the functional similarity between one set of LD blocks with 
another set of LD blocks (a BPM), where we expect a BPM to have strongly supporting functional relevance, 
thus we use average to enforce the relevance is not just from few block pairs. This is different from the max 
function used to summarize the functional similarity between two LD blocks based on the genes that they 
are assigned with. 
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Table 2: Number and sizes of the (significant) BPMs discovered from each dataset 

that, Dxx, Dyy are only used for filtering out uninteresting BPMs (threshold used: 0.35), 
and we do not compute FDRs for them. As for Ssum, Smax and Smin, we observe that they 
generally do not produce significant FDRs, and thus, it is not discussed in rest of this section 
but available on the supplementary website. 

Two BPMs may overlap with each other on either of the two constituent LD block sets 
{X and Y). Therefore, from the set of BPMs discovered from the real (or each permuted 
case-control grouping), we first select a subset of BPMs, by going through all the BPMs in 
decreasing order of |X| and add a BPM to the subset if it has less than 50% overlap on 
either the two sets {X or Y) with all the BPMs that have been selected (initially empty). 
Note that, for the estimation of FDR for a BPM of size \X\ by \Y\ for one of the four 
measures {Dxy, Fxx, Fxy and Fyy), the null distribution is only estimated from those 
random BPMs (discovered with permuted case-control groupings) that have sizes greater 
or equal to both Smax and Smin- Such a size-specific estimation removes the bias of size 
difference that different BPMs may cause. 

Table |2] shows the details of the number of BPMs discovered from each of the case-control 
datasets and the number of significant BPMs by each of the four measures (FDR < 0.25). 

Several observations can be made from Table [2J 

1. Statistical significance of the discovered BPMs: Many BPMs are significant 
with respect to the Dxy on the the density of genetic interactions across the two sets. 
Specifically, there are significant BPMs discovered with respect to Dxy on all the six 
datasets. This indicates that the existence of genetically buffered functional modules as 
captured by the BPM structure is evident in genome-wide case-control SNP datasets. 
These BPMs generally have larger sizes and interaction densities than the random 
BPMs discovered from permutation tests. 

2. Biological significance of the discovered BPMs: Many BPMs are significant 
with respect to the last three measures, i.e. Fxx, Fyy and Fxy, which suggests 
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that they are not only statistically significant, but are also supported by independent 
genomic/proteomic evidence. More specifically, the density of edges in the functional 
network, within both compensatory modules {Fxx, ^yy) as well as between them 
{Fxy), is frequently higher for the BPMs derived from the real data as compared to 
the random permutations of case-control groupings. 

While Table |2]gives an overall summary of the BPMs discovered from each of the datasets. 
Figure [5] shows two illustrative examples of BPMs discovered from Parkinson and Myeloma- 
survival. Many BPMs from the other four datasets are available on the supplementary 
website. 

Several observations can be made from Figure O We will use the BPM discovered from 
Parkinson as a running illustrative example of their interpretation. 

1. The Statistical significance of the BPM discovered from Parkinson: The 

table on the left shows the FDRs for each of the six BPM measures. As shown, the 
six FDRs are all below 0.15 with several below 0.05 (the latter four). A Dxy of 0.73 
(FDR < 0.01) shows the dense genetic interactions between the two sets of LD blocks 
in this BPM. An Fyy of 0.98 (FDR 0.05) indicates that the right set of LD blocks 
have strong LD block-block functional similarity, which agrees the concept of BPM, 
i.e. each side of the BPM is likely to be involved in a common pathway or a process. 
An Fxy of 0.70 (FDR 0.03) suggests that the two sets of LD blocks in this BPM 
may control two functions respectively, which may have compensatory function under 
genetic variations. 

2. Most SNPs in the BPM have insignificant individual association: The his- 
tograms of individual SNP association {—logio^x^ — test p- value)) indicate that, almost 
all of the SNPs in the significant BPM have insignificant individual p- value (below the 
green dashed line which corresponds to — /o5fio(0.05) even before Bonferroni correc- 
tion). This supports the utility of the proposed framework for discovering significant 
BPMs that contain SNPs with weak individual association, which would be mostly 
ignored by existing approaches. 

3. All cross-set SNP pairs in the BPM have insignificant genetic interaction 
when considered separately: We also compute FDR for each SNP pair in a BPM 
across the two sets of LD blocks. For the two examples shown in the figure, the lowest 
FDRs of individual SNP pairs are 0.45 and 0.75 in the Parkinson example and the 
Myeloma- survival example respectively. This indicates that none of the SNP pairs in 
the two datasets would be considered as significant epistasis if they are considered in an 
isolated manner. In contrast, discovering them as a BPM with the proposed approach 
yields their significant FDRs. This demonstrates the effectiveness of the proposed 
functional analysis for genome-wide case-control SNP datasets. 
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4. The two sets in the BPM contain LD blocks either from different chromo- 
somes or from the same chromosome but with different genotype profiles: 

The two heat maps of all pair's hamming similarities for the all the SNPs (grouped 
by LD blocks) indicate that the SNPs within each LD block have correlated genotype 
profiles while those from different LD blocks have different genotype profiles. This 
indicates that the BPM is not a trivial biclique that is due to LD structure. Note that, 
this is an illustration of the effectiveness and necessity of constructing and analyzing 
a block-level genetic interaction network instead of either a SNP-level network or a 
gene-level network as discussed in the methods section. The diversity of chromosomes 
in both sets of a BPM indicates that the two compensating functions (e.g. pathways) 
are from different chromosomes, showing the complexity of the mechanisms underlying 
the disease phenotype. 

An illustrative interpretation on the BPM discovered on the Parkinson's 
dataset: Given the statistical significance of the BPM structures discovered across many 
of the GWAS studies, we further asked whether the genes on both sides were enriched for 
known pathways or processes using gene sets defined by Gene Ontology terms or MSigDB. 
Consistent with their overlap with the functional network, a number of the modules involved 
in BPMs did show significant enrichment (see Figure |5] for examples on the Parkinson and 
Myeloma- survival datasets or the supplementary website for a complete list). The BPM 
shown in Figure [5] that is associated with Parkinson's disease was a pair of modules, one 
enriched for ion channel activity and the other enriched for protein kinase activity. The 
ion channel activity enrichment is driven by three genes ACCNl, KCNEl, TRPM3, each 
of which comes from a separate LD block on three different chromosomes (chromosomes 
9, 17, and 21, respectively). This is potentially interesting as potassium channels were re- 
cently suggested as a possible new target for therapeutics [33]. It was hypothesized that 
such ion channels may affect the progressive loss of dopamine neurons, which is the main 
cause of Parkinson's disease. It is also interesting to note that mouse knock-out mutants 
of KCNEl, a potassium voltage-gated channel protein, have been associated with the so- 
called shaker /waltzer phenotype, which is characterized by rapid bilateral circling during 
locomotion [TO] . 

The complementary module in this Parkinson's disease BPM was enriched for protein ki- 
nase activity due to the presence of the two protein kinases MERTK (c-mer proto-oncogene 
tyrosine kinase) and EIF2AK3 (eukaryotic translation initiation factor 2-alpha kinase 3), 
suggesting that combined mutations affecting ion channel activity and one of these signaling 
pathways may be causal determinants of Parkinson's. While the specific link is unclear, it 
is interesting to note that EIF2AK3 is one of the key regulators of the eukaryotic transla- 
tion initiation factor and thus, it controls global rates of protein synthesis in the cell 0. It is 
certainly conceivable that mutations in such a protein with relatively global infiuence on pro- 
tein levels could modify, and in this case aggravate, the effects of other mutations. In fact, 

®http: / /www. genecards.org/cgi-bin/ carddisp.pl?gene=EIF2AK3 
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mutations in another translation initiation factor, EIF2B, were recently associated with 
Vanishing White Matter disease, a disorder that causes rapid deterioration of the central 
nervous system [22]. Given the large number of genes involved in the LD blocks associated 
with each BPM, identifying the genes functionally responsible could be quite difficult and 
is one of the main caveats of this type of analysis. However, this process can be aided by 
simple enrichment analysis, which in this case appears to implicate processes whose link to 
Parkinson's disease seems plausible. 

4 Discussion 

In this paper, we target the construction and functional analysis of disease-specific human 
genetic interaction networks from genome-wide association data designed for case-control 
studies on complex diseases. Specifically, we focused on genome-wide case-control SNP 
data, which has its linkage disequilibrium (LD) structure. We discussed the challenges in 
the detection of genetic interactions due to LD structure and propose a general approach with 
three steps: (1) estimating SNP-SNP genetic interactions, (2) identifying genome segments in 
linkage disequilibrium (LD) and mapping of SNP interactions to LD block-block interactions, 
and (3) mapping for LD blocks to genes. 

We performed two sets of functional analyses on six case-control SNP datasets to study if 
the constructed human genetic interaction network has functional significance supported by 
independent biological evidence by comparing with a human functional networks. We also 
demonstrated how the constructed interaction network can provide high-resolution insights 
about the compensation between pathways in their joint association with a disease phenotype 
by discovering between-pathway models. 

Comprehensive experimental results on six case-control datasets demonstrated that (i) 
From the perspective of genetic interaction analysis, the constructed human genetic inter- 
action network has functional significance, and that biologically interesting motifs such as 
BPM that are common in lower eukaryotes also exist in the genetic interaction network 
discovered from human genetic variations associated with complex diseases such as cancers 
and Parkinson's disease; (ii) From the perspective of GWA data analysis, discovering BPMs 
from the constructed human genetic interaction network can help reveal novel biological 
insights about complex diseases, beyond existing approaches for GWAS data analysis that 
either ignore interactions between SNPs, or model different SNP-SNP genetic interactions 
separately rather than studying global genetic interaction networks as done in this study. 

This paper focused on the presentation of the overall framework of constructing and 
analyzing human disease-specific genetic interaction network with GWAS data. There are a 
number of interesting and necessary directions for future work such as exploring the effect 
of different epistasis measures, the effect of different LD block identification approaches, the 
effect of different aggregation functions, and the effect of different gene mapping approaches. 

In conclusion, we want to highlight that, even though we chose to use some relatively 
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simple and conservative options in the framework which needs further exploration as dis- 
cussed above, the significant statistical and biological evidence obtained from the two sets 
of functional analyses demonstrate the effectiveness of the current framework in revealing 
several consistent observations over six case-control SNP datasets. 
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Figure 4: Functional similarity of top pairs of LD blocks in the genetic interaction network {M2) 
and genetic interaction profile similarity network {M4) on Parkinson and Myeloma-survival datasets. 
Each row shows two figures for each of the two datasets. Each subfigure on the left column shows 
three bars (with standard deviation also indicated on each), which tell the average functional simi- 
larity of three types of block pairs respectively: (1) those with high similarity of genetic interaction 
profile (top 300 edges in M4), (2) block pairs with high genetic interaction (top 300 edges in M2) 
and (3) all block pairs with functional similarity (baseline). The LD block-block pairs with zero 
functional similarity are not considered in any of the the three bars. We also include three p-values 
based on the Wilcoxon test on the functional similarities in the three groups. Each of the subfig- 
ures on the right column shows, for each pair of LD blocks, its genetic- inter action-profile similarity 
(in M4) and its functional similarity (in Mq). The cluster of high functional similarities on the 
top-right corner corresponds to the significantly higher mean in the first bar in the subfigure on 
the left column of the same row. 
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Figure 5: Two illustrative BPMs discovered from Parkinson and Myeloma- survival. For each 
BPM, we display the LD block IDs on both sides and the chromosome that each block is 
located on. In addition, the seven measures and the corresponding FDRs are also shown for 
each BPM. Further, we also list the top five enriched GO annotations or MSigDB gene sets. 
A histogram of the —Ioqiq p-values {x^ test) for all the individual SNPs in the blocks of each 
side of a BPM is also shown. A dashed line is shown in each histogram, which corresponds 
to the — /o5fio(0.05) p-value without Bonferroni correction. All the SNPs have insignificant 
individual p-value supports the highlight of the proposed framework to discover significant 
BPMs containing SNPs with weak individual association, that are mostly ignored by existing 
approaches. 
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